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Abstract 

The evolutionary origins of the multitude of duplicate genes in the plant genomes are still incompletely understood. To gain 
an appreciation of the potential selective forces acting on these duplicates, we phylogenetically inferred the set of metabolic 
gene families from 10 flowering plant (angiosperm) genomes. We then compared the metabolic fluxes for these families, 
predicted using the Arabidopsis thaliana and Sorghum bicolor metabolic networks, with the families' duplication 
propensities. For duplications produced by both small scale (small-scale duplications) and genome duplication (whole- 
genome duplications), there is a significant association between the flux and the tendency to duplicate. Following this global 
analysis, we made a more fine-scale study of the selective constraints observed on plant sodium and phosphate transporters. 
We find that the different duplication mechanisms give rise to differing selective constraints. However, the exact nature of 
this pattern varies between the gene families, and we argue that the duplication mechanism alone does not define 
a duplicated gene's subsequent evolutionary trajectory. Collectively, our results argue for the interplay of history, function, 
and selection in shaping the duplicate gene evolution in plants. 
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Introduction 

The contribution of gene duplication to evolution has long 
been a topic of interest (Taylor and Raes 2004), but in the last 
1 0 years there has been a resurgence of interest in the varied 
fates of such duplications (Zhang et al. 2002; Kondrashov 
and Koonin 2004; Adams and Wendel 2005; Aury et al. 
2006; Rodriguez et al. 2007; Barker et al. 2008; Liang 
et al. 2008; Ha et al. 2009; Innan and Kondrashov 2010; 
Ramsey 2011). Among those fates, the important roles 
played by genetic drift and simple changes in the gene "dos- 
age" are increasingly appreciated. In several contributions, 
Lynch et al. have argued that the relatively small population 
sizes of multicellular eukaryotes could result in the fixation 
of many gene duplications through nonadaptive processes 
(Force et al. 1999; Lynch and Conery 2003; Lynch 2007). 
These processes, of course, still occur under the overall um- 
brella of natural selection. For instance, selection may act on 
gene dosage in one of two ways. First and most obviously, 
duplication of a gene may increase the rate of transcription 



and hence the translation of the encoded protein, increasing 
its abundance. We have previously referred to this possibility 
as a selection on "absolute" dosage (Bekaert et al. 201 1). If 
a higher protein expression is selectively beneficial, we expect 
copy number polymorphisms will be fixed (Blanc and Wolfe 
2004a; Kondrashov and Kondrashov 2006). The second pos- 
sibility is that of a selection on the "relative dosage," where an 
event affecting one of several genes that have coevolved to- 
gether (i.e., a single gene duplication or differential paralog 
loss after polyploidy) introduces selective costs. This concept 
is known as the "dosage balance hypothesis" (Freeling 2009) 
and has been explored by a number of authors (Papp et al. 
2003; Freeling and Thomas 2006; Birchler and Veitia 2007; 
Edger and Pires 2009). Here, we focus on the role of absolute 
dosage selection in determining duplicate fates. 

As the first complete genome sequences became available, 
their patterns of gene duplication were explored to under- 
stand, among other questions, the role of natural selection 
in duplicate gene fixation (Lynch and Conery 2000; Gu 
et al. 2002; Wagner 2002; Gu et al. 2003). Those duplications 
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had multiple origins, including whole-genome duplications 
(WGDs or polyploidy), as well as segmental, tandem, and 
retro-duplications (referred to here collectively as small-scale 
duplications or SSDs; Cannon et al. 2004; Thomas et al. 
2006; Freeling 2009). The preponderance of polyploids among 
angiosperms (Wendel 2000) has led plant biologists to focus 
on understanding the patterns of the duplicate gene loss and 
the retention following WGD events (Bowers et al. 2003; Blanc 
and Wolfe 2004a; Blanc and Wolfe 2004b; De Bodt et al. 
2005; Maere et al. 2005; Pfeil et al. 2005; Sterck et al. 
2005; Cui et al. 2006; Freeling and Thomas 2006; Paterson 
et al. 2006; Schranz and Mitchell-Olds 2006; Town et al. 
2006; Tuskan et al. 2006; Tang, Wang, et al. 2008; Barker 
et al. 2009; Edger and Pires 2009; Soltis et al. 2009; Wood 
et al. 2009; Duarte et al. 2010; Coate et al. 2011; 
Jiao etal. 201 1; Schnableetal. 201 1). In this work, we consider 
gene families with members derived from both SSD and WGD. 
These families are inferred from 10 angiosperm genomes: 
seven dicots (Arabidopsis, papaya, soybean, Medicago trunca- 
tula, poplar, peach, and grape) and three monocots (Brachy- 
podium distachyon, rice, and sorghum). 

Of course, the taxa examined have a long history of poly- 
ploidy. Within the eudicots, the oldest genome duplication 
event, y, was an ancient hexaploidy that characterizes the 
Rosidae (sensu Soltis et al. 201 1), if not the core eudicots 
(Gunneridae sensu Jaillon et al. 2007; Lyons, Pedersen, Kane, 
Alam, et al. 2008; Lyons, Pedersen, Kane, Freeling, et al. 
2008; Ming et al. 2008; Freeling 2009; Argout et al. 2011; 
Jiao etal. 2011; Shulaev etal. 2011; Soltis etal. 201 1). Com- 
parative genomics suggest that the lineage leading to poplar 
(Populus trichocarpa) underwent an additional WGD event, 
whereas that of the thale cress (Arabidopsis thaliana) had 
two: p and a. That these two duplications are independent 
is suggested by their absence in both grape {Vitis vinifera) 
and papaya (Carica papaya; fig. 1; Jaillon et al. 2007; Ming 
et al. 2008; Tang, Wang, et al. 2008; Freeling 2009). Analysis 
of the nonsynonymous substitution rates in the soybean (G/y- 
cine max) genome has revealed two WGD events post- 
WGD-y: one shared with peanut (Arachis hypogaea), a basal 
legume, and a more recent soybean-specific duplication 
(Bertioli et al. 2009; Schmutz et al. 2010). The 3:1 ratio of 
grape to rice (Oryza sativa) genomic segments suggests that 
the y paleohexaploidy is dicot-specific (Jaillon et al. 2007). 
However, cereal monocots also have a WGD event, p, basal 
to their radiation (Paterson et al. 2004); rice, sorghum (Sor- 
ghum bicolor) and purple false brome (B. distachyon) show 
no evidence of further WGD events (Throude et al. 2009; 
Vogel et al. 2010). 

There is mounting evidence that the gene duplications 
created by WGD and by SSD differ in their ultimate fates 
(Seoigheand Wolfe 1999; Pappetal. 2003; Blanc and Wolfe 
2004a, 2004b; Cannon et al. 2004; Aury et al. 2006; 
Thomas et al. 2006; Hakes et al. 2007; Conant and Wolfe 
2008; Freeling 2008; Edger and Pires 2009; Freeling 2009; 



Coate et al. 201 1). To cite just one example (relevant to this 
work), Maere et al. (2005) found that the ion transporters 
were overretained after WGD but underretained following 
SSD. The study of Arabidopsis WGDs by Blanc and Wolfe 
(2004a) reached similar conclusions but also found that 
genes involved in phosphate metabolism were significantly 
overretained following the recent WGD-a. 

We are interested in whether the dosage effects are 
a strong predictor of duplicate retention, and here, we have 
taken both a "high level" phylogenomic approach and 
a "low-level" single-gene approach to look for evidence 
of such selection. Our first analysis extends our previous work 
in Arabidopsis, where we found an association between met- 
abolic flux and some, but not all, of the Arabidopsis WGDs 
(Bekaert et al. 201 1). Specifically, we hypothesize that genes 
in families with high flux will be, on average, over duplicated. 
Given that we have previously found significant differences in 
duplication propensity between cellular compartments 
(Bekaert etal. 201 1; Hudson and Conant 201 1), we also test 
for a relationship of duplicability and compartment. Addition- 
ally, we hypothesized that the WGD-produced and SSD- 
produced gene duplications will differ in their postduplication 
selective constraints. We evaluate this by narrowing our focus 
to a group of ion transporters. Such transporters have been 
found to have an outsized influence on the metabolic flux 
(Kacser and Burns 1981 notwithstanding; Brown et al. 
1998; Pritchard and Kell 2002). Furthermore, their evolution- 
ary behavior is distinct from other metabolic genes following 
both SSD and WGD (Lin and Li 2010; Bekaert and Conant 
201 1). Given the complexity of plant genome evolution, lim- 
iting our analysis to single gene families also has the advan- 
tage of allowing us to carefully distinguish WGD from SSD. 

Materials and Methods 

Estimation of Metabolic Flux 

As previously described (Bekaert et al. 201 1), we used the 
Systems Biology Research Tool v2.0.0 (Wright and Wagner 
2008) to perform flux-balance analysis on the A. thaliana 
and 5. bicolor metabolic networks (de Oliveira Dal'Molin 
et al. 2010a, 2010b). We estimated the maximal biomass 
production possible under photosynthetic conditions (a 
fixed level of photon import allowed, sugar imports forbid- 
den) for both A. thaliana and S. bicolor networks (Pearson's 
correlation of flux r = 0.638, P < 10~ 15 ) and nonphotosyn- 
thetic conditions (photon import forbidden, fixed sugar 
imports allowed) for the A. thaliana network. Flux-balance 
analysis was also run by limiting the biomass and maximizing 
either photon import or sugar import (Bekaert et al. 201 1), 
the results were similar and qualitatively the same. Because 
the distinctions between the two networks are in the pho- 
tosynthetic reactions and because the sorghum metabolic 
network is derived from the Arabidopsis one, inclusion of 
the sorghum root data would be less informative and is 
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Fig. 1. — Plant species used in reconciling gene trees. The phylogenetic relationships (branch lengths are arbitrary) among these species have been 
described previously (Moore et al. 2007; Paterson et al. 2009). The histograms depict the number of duplications per gene family. Thus, on the x-axes is 
the number of duplicates observed in a family at that node in the tree (on a natural log scale). The y-axes are then the frequency of families with that 
number of duplications. The scale is consistent across histograms. Black circles indicate whole-genome duplication events. 



hence omitted. In each case, we also made every possible 
reaction knockout whereby a given reaction's flux is con- 
strained to zero and the remainder of the network is reopti- 



mized. After knockout, all fluxes were normalized by the 
value of the biomass flux. Then, for each reaction, we se- 
lected the observed maximum flux, across all conditions. 
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By doing so, we find what is essentially an upper bound on 
the flux of each reaction. It would obviously be desirable to 
also estimate the sensitivity of the network to changes in 
flux through each reaction. However, we do not have kinetic 
data for the entire network and these values cannot be es- 
timated with flux-balance analysis. Instead, we compared 
this maximal flux to the duplication status of each reaction 
node. In cases where there was more than one flux value 
associated with a gene family, all possible flux values for that 
family were used in our association analyses, meaning that 
large gene families will not tend to be biased toward high 
flux because they encompass more reactions. 

Gene Family Identification 

We used the list of A thaliana enzymes from the de Oliveira 
Dal'Molin et al. (201 0a) metabolic network to identify enzyme 
gene families in the genomes of 10 flowering plants (fig. 1; 
A thaliana, B. distachyon, C. papaya, G. max, M. trunculata, 
O. sativa, P. trichocarpa, Prunus persica, S. bicolor, and 
V. vinifera The Arabidopsis Genome Initiative 2000; Young 
et al. 2005; Ouyang et al. 2006; Tuskan et al. 2006; 
Jaillon et al. 2007; Ming et al. 2008; Paterson et al. 2009; 
Schmutz et al. 2010; The International Brachypodium 
Initiative 201 0; Jung et al. 2009; Vogel et al. 2010). Homol- 
ogous relationships were inferred using GenomeHistory 
(Conant and Wagner 2002), which calculated the nonsy- 
nonymous substitution rate (K a ) for all gene pairs with 
BLASTscores lower than 0.0001 . Gene families were iden- 
tified by single-linkage clustering with a cutoff in nonsynon- 
ymous divergence of K a < 0.20 for A thaliana/A. thaliana 
comparisons and K a < 0.30 for all other comparisons 
(Powell et al. 2008). Gene pairs with K a values below these 
thresholds were treated as nodes connected by an edge in 
the provisional gene family networks. These K a parameters 
were selected after analyzing the results of using different 
K a thresholds. For each threshold, we iteratively removed 
single edges from the provisional gene families. The chosen 
K a thresholds were the largest values that did not 
cause a noticeable change in the constituency of the provi- 
sional gene families when any single edge was removed (data 
not shown). Families with fewer than four member genes were 
excluded. We used these gene families to associate a gene tree 
with each gene in the S. bicolor metabolic network. 

Of the 138 pathways involved in Arabidopsis central me- 
tabolism, six contain no enzymes in the gene families we an- 
alyzed. Two of them (1,4-dichlorobenzene degradation and 
C21 -steroid hormone metabolism) contain enzymes only 
present in Arabidopsis. The transport of ot-D-glucose from 
the cytoplasm to the external cellular component contains 
a gene (AT5G 18880), which is unclearly annotated. The 
transport of citrate and nitrate and the biosynthesis of mono- 
terpenoid include four A thaliana genes (citrate: AT1 G02260, 
monoterpenoid: AT3G25830 and AT4G16730, and nitrate: 



AT5G 1 4570) that our pipeline split into gene families that were 
too small to analyze phylogenomically. 

Phylogenomics of Gene Families 

Multiple sequence alignments of the protein sequences for 
each gene family were computed with MUSCLE v3.6 (Edgar 
2004) using default parameters. Codon alignments were 
deduced from those alignments having 50 or more amino 
acids. We then inferred maximum likelihood gene trees 
using RAxML v7.0.4 (Stamatakis et al. 2008) with a general 
time-reversible model and discrete approximation of the 
gamma distribution (GTR + T). Confidence values were as- 
signed to the gene trees from 100 bootstrap replicates. A 
relatively limited number of replicates were computed be- 
cause we only wished to use these bootstrap statistics to 
identify nodes in the phylogeny with low support (<65%) 
prior to gene tree/species tree reconciliation. We thus recon- 
ciled all inferred gene trees with the species tree in figure 1 
(Moore et al. 2007; Wang et al. 2009). To do so, we used 
NOTUNG v2.6 (Chen et al. 2000) to infer the most parsimo- 
nious pattern of the gene duplication and loss. Gene tree 
nodes with less than 65% bootstrap support were treated 
as polytomies and allowed to rearrange in order to minimize 
the number of duplications and/or losses (in practice, choos- 
ing support value thresholds between 50% and 80% 
produced similar results; data not shown). Using these parsi- 
mony reconstructions, we calculated the number of duplica- 
tions (and number of duplications per species) for each gene 
tree. 

Manual Annotation of Transporter Gene Trees 

Coding sequences for the nine annotated PHT1 s, one PHT2, 
three PHT3s, six PHT4s, and eight NHXs of A thaliana were 
downloaded from TAIR (Swarbreck et al. 2008). A BLASTP 
search of the A thaliana genome with these 19 and 8 se- 
quences identified no further phosphate or sodium trans- 
porters in the genome. We then used BLASTP to search 
for ion transporter homologs in the genomes of papaya 
and poplar. We retained genes with BLAST E-values less than 
10~ 20 as putative members of a given transporter family. 
Our homology estimation procedure always placed genes 
from C. papaya and P. trichocarpa into only a single A thali- 
ana transporter family. Gene trees were constructed as de- 
tailed above. In the case of the NHXs, one A thaliana gene 
(At2g01980) aligned poorly with the other NHXs and was 
excluded from the alignment and gene tree. 

We manually assigned nodes in these phylogenies as ei- 
ther speciation or duplication events (fig. 2). Nodes connect- 
ing genes from the same species were labeled as duplication 
events where nodes connecting genes from different spe- 
cies were labeled speciation events. Because we were work- 
ing with only a handful of genes, it was possible to make 
a more accurate distinction between SSD and WGD genes 
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Fig. 2. — Ion transporter gene trees used in this study. Branches demarcating speciation events are colored orange, whole-genome duplication 
events green, and non-WGDs purple, (a) High-affinity phosphate transporters AtPHTI; 1 -AtPHTI; 9 with 16 P. trichocarpa and 7 C. papaya homologs. 
(b) Low-affinity phosphate transporters AtPHT2; 1 with 2 poplar and 1 papaya homologs. (c) Mitochondrial phosphate transporters AtPHT3; 1-ATPHT3; 
3 with 6 poplar and 1 papaya homologs. (d) Chloroplast phosphate transporters AtPHT4; 1-AtPHT4; 6 with 10 poplar and 6 papaya homologs. (e) 
Sodium ion transporters AtNHXI -AtNHX6, AtNHX8 with 6 poplar and 4 papaya homologs. 



for these transporters than was possible for the genome- 
scale analyses. Thus, whole-genome duplicates were inferred 
in cases where the paralogs fit into distinct paralogous syn- 
teny blocks from the Plant Whole Genome Duplication Data- 
base (PGDD; Tang, Bowers, et al. 2008). Nodes connecting 
gene paralogs that could not be assigned using the PGDD 
were inferred to be SSDs (fig. 2). These manual duplication 
or speciation designations agreed with the automatic as- 
sessments of NOTUNG. 

Selective Constraint Following Speciation and Duplication 
Events in Five Families of Ion Transporters 

The selective constraint (ratio of nonsynonymous substitu- 
tions to synonymous substitutions, i.e., KJK S ), for each gene 
tree was estimated by maximum likelihood under the MG/ 
GY94 codon model (Goldman and Yang 1994; Muse and 
Gaut 1 994): for details, see Conant et al. (2007). We tested 
three nested models of evolution: requiring all branches to 



have the same value of KJK S (R_Null), allowing different val- 
ues of KJK S for branches following a speciation node from 
those following a duplication node (R_Dupl), and a model 
with differing values of K a /K s for branches following speci- 
ation, whole genome, and small-scale duplications (R_WGD). 
We compared these three models with nested likelihood ratio 
tests and evaluated statistical significance using the x 2 distri- 
bution, knowing that R_WGD has one more free parameter 
than R_Dupl, which in turn has one more parameter than 
R_Null. 

Analysis of Constraints by Gene Ontology Slim 
Annotation 

Gene ontology slim (GO Slim) annotations were obtained 
for each A. thaliana gene from TAIR. GO Slim categories 
were further condensed (supplementary table 1, Supple- 
mentary Material online) and transferred to our gene fam- 
ilies. Spearman's rank correlations between the flux and 
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number of duplications in each gene family were calcu- 
lated in SAS (v9.2.1, Cary, NC) for all the cellular compart- 
ments and functions. Note that gene families could 
appear in more than one compartment or functional 
group. We applied a Bonferroni multiple-test correction 
equal to the number of either compartments or functional 
groups analyzed, resulting in the respective values of a, 
0.0055 and 0.0042. 

We also used the Wilcoxon rank test (SAS v9.2. 1 ) to ask 
if the number of duplications per gene family differed for 
each cellular compartment or function as compared with 
the reminder of the genome. We used the same Bonferroni 
multiple-test corrections as previously. 

Results 

Computing Gene Families and Flux Values 

We estimated the flux through each biochemical reaction in 
the Arabidopsis and sorghum metabolic networks using flux- 
balance analysis (Orth et al. 2010), maximizing the produc- 
tion of new cell mass for a fixed input of either light energy 
in both Arabidopsis and sorghum (in photosynthetic tissues) 
or carbohydrates for Arabidopsis (in nonphotosynthetic tis- 
sues, see Materials and Methods). We included the sorghum 
network to be sure that the differences in C3 and C4 pho- 
tosynthesis were not greatly biasing our results. 

Maximal flux values ranged from 0 to 3865120 (arbitrary 
flux-balance units) in the Arabidopsis leaf, from Oto 61 56740 
in the Arabidopsis root, and from 0 to 2560860 in sorghum, 
when the biomass production is maximized and scaled to 
1 000 units. We then coupled those data to a set of cross- 
genome gene families identified from the 10 plant genomes 
(Materials and Methods). The result was a set of 735 gene 
families with associated metabolic fluxes. Of these 735 gene 
families, 463 have absolute flux values greater than zero. 
These families vary in size from 4 to 306 genes. The number 
of non-null-flux values associated with each family ranges 
from 1 to 1 3, with 90% having only one associated flux value 
and only three having 10 or more flux values. Those three 
families function as ATP synthases, phospholipid transporters, 
and cellulose synthases (functional Gene Ontology annota- 
tion from TAIR; Swarbreck et al. 2008). The number of gene 
duplications per family varies from 0 to 210, with a mean of 
3.21 duplications per species. Reactions with no flux can 
result either from failure to include certain metabolites 
in the biomass reaction or from a reaction not being used 
in certain conditions. Because of the potential for error intro- 
duced by these two possibilities, we present our results both 
with and without null-flux reactions. 

Correlation Between Number of Duplications and 
Maximum Metabolic Flux 

The correlation between the number of duplications in 
a gene family and the maximal flux is positive and significant 



Table 1 

Correlations Between Duplication and Flux by Gene Family 



Excluding Null- 
All Flux Values Flux a 





r* 




r 


P 


Duplications per 


gene family 








All conditions 


0.245 


<10~ 15 


0.336 


<io- 15 


C3 leaves 


0.218 


<10~ 15 


0.328 


<10~ 15 


C4 leaves 


0.176 


<10~ 8 


0.218 


<io- 4 


Roots 


0.223 


<10" 15 


0.359 


<10~ 15 


Duplications per species per gene family d 






All conditions 


0.227 


<10~ 15 


0.306 


<10- 15 


C3 leaves 


0.203 


<10~ 15 


0.272 


<io- 14 


C4 leaves 


0.163 


<10~ 7 


0.206 


<io- 4 


Roots 


0.211 


<10~ 15 


0.342 


<io- 15 



a Flux values equaling 0 can have confounding biological and computational 
meanings. 

b Spearman's r. 

c Correlations and statistical significance calculated in R. 
d Number of duplication events per gene family divided by the number of species 
in that family. 

for both C3 and C4 model networks, whether or not null- 
flux reactions are included and whether duplications are cal- 
culated per species or per family (table 1). 

Association of Flux and Duplication is neither Taxa nor 
Duplication-Mechanism Specific 

As described, these species share a history of WGD (fig. 1). 
We summed the number of duplications on each branch in 
figure 1 , separating those with lineage-specific WGDs from 
those without. Duplications in both groups are significantly 
and positively correlated with maximum flux (WGD: r = 
0.111, P < 0.05; SSD: r = 0.094, P < 0.05). Of course, 
the branches containing WGDs will also have some back- 
ground level of SSD, meaning that the duplications on these 
branches will not be exclusively due to WGD. However, the 
similarity in correlations seen between the two types of 
branch suggests that a more careful accounting of dupli- 
cates is unlikely to yield different results. Similarly, we found 
significant positive associations of duplication and flux 
for the monocot subtree as well as the eudicot tree with 
A. thaliana removed (P < 0.05). The similarity of the results 
for these subtrees implies that our results are not specific to 
Arabidopsis, even though one of the primary metabolic net- 
works used is from this organism. Among the terminal no- 
des with rice and soybean show significant associations of 
flux and duplication after a Bonferroni multiple-testing cor- 
rection (P < 0.00256). Unfortunately, for the remainder of 
the tip taxa, it is difficult to distinguish between the lack of 
an association and the lack of sufficient numbers of 
duplicates to discern if that association might exist. Similarly, 
the flux values inferred from the sorghum C4 leaves show 
a mixed pattern of associations and lack thereof depending 
on the precise data set used (0.1689 < P < 0.9653). 
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Table 2 

Duplication Status per Gene Family Split by Cellular Compartment 



Duplication 

versus Flux a Duplication 11 



Cellular Compartment 


n 


f 


P 




P 


Nucleus 


56 


0.320 


0.016 


3.481 


0.0005 


Cytosol 


74 


0.133 


0.258 


4.910 


<0.0001 


Chloroplast and plastid 


273 


0.275 


<0.0001 


-0.646 


0.518 


Mitochondria 


134 


0.434 


<0.0001 


1.012 


0.311 


Plasma membrane 


97 


0.135 


0.187 


7.371 


<0.0001 


Endoplasmic reticulum 


44 


0.304 


0.045 


-1.161 


0.246 


Golgi apparatus 


12 


0.401 


0.196 


2.280 


0.023 


Cell wall 


52 


0.343 


0.013 


3.210 


0.001 


Extracellular 


51 


0.089 


0.532 


5.115 


<0.0001 



Bold values are significant at a Bonferroni corrected a = 0.0055. 
a Duplications per gene family versus the maximum flux. 
b Wilcoxon rank test of difference across compartments {positive values: 
overduplication; negative values: underduplication). 

c Spearman's r, calculated in SA5 (v9.2.2, Cary, NC). 
d Wilcoxon's Z, calculated in SAS (v9.2.2, Cary, NC). 

Association of Flux and Duplication Extends Across 
Compartments and Functional Annotations 

Gene families were associated with GO Slim annotations 
(supplementary table 1, Supplementary Material online) for 
both cellular compartment and function. We found signifi- 
cant Spearman's correlations between the flux and duplica- 
tion rate for the metabolic gene families from the chloroplast 
and mitochondria (table 2). Likewise, gene families that have 
a role in DNA or RNA binding or metabolism, hydrolase ac- 
tivity, and responses to stimuli or stress had significant corre- 
lations between the number of duplications and flux (table 3). 

Table 3 

Duplication Status per Gene Family Split by Functional Annotation 



Duplication 

versus Flux a Duplication 10 



Function 


n 




P 




P 


Cell organization and biogenesis 


29 


0.209 


0.274 


1.010 


0.312 


Developmental processes 


20 


-0.132 


0.578 


1.693 


0.090 


DNA or RNA binding or 


26 


0.760 


<0.0001 


-2.284 


0.022 


metabolism 












Electron transport 


7 


0.860 


0.013 


0.460 


0.645 


Hydrolase activity 


114 


0.310 


<0.001 


-1.661 


0.097 


Kinase activity 


62 


-0.031 


0.817 


1.167 


0.243 


Nucleic acid or Nucleotide 


94 


0.062 


0.554 


0.011 


0.991 


binding 












Protein binding or metabolism 


121 


0.139 


0.128 


1.463 


0.143 


Signal transduction 


13 


0.104 


0.735 


2.450 


0.014 


Stimulus or stress response 


199 


0.302 


<0.0001 


2.650 


0.008 


Transferase activity 


166 


0.211 


0.006 


-0.833 


0.405 


Transporters or transport 


56 


-0.034 


0.801 


1.755 


0.079 



Bold values are significant at a Bonferroni corrected ot = 0.0042. 
a Duplications per gene family versus the maximum flux. 
b Wilcoxon rank test of difference across compartments (positive values: 
overduplication; negative values: underduplication). 

c Spearman's r, calculated in SAS (v9.2.2, Cary, NC). 
6 Wilcoxon's I, calculated in SAS (v9.2.2, Cary, NC). 



To determine whether duplication rates differed among 
compartments or classes, we used Wilcoxon rank-sum test 
(Z-scores in tables 2 and 3). Although gene families could 
appear in more than one annotation group, families located 
in the nucleus, cytosol, plasma membrane, cell wall, and 
extracellular space were significantly overduplicated com- 
pared with all other gene families (table 2). No functional 
categories were significantly overduplicated (table 3). 

Selection on Sequence Evolution of Ion Transporters 

We chose to analyze the ion transporters because of their 
interesting role as potential chokepoints. In the metabolic 
networks used in this analysis, the gene families represent- 
ing transporters have significantly higher flux than nontrans- 
porter gene families (Mann-Whitney one-tailed P < 1CT 15 ). 
However, we found no significant correlation between the 
flux and duplicability among transporter gene families (P = 
0.599). Therefore, we chose to look at the fine-scale differ- 
ences in selection in two classes of ion transporters, phos- 
phate and sodium. These elements have distinct roles in the 
growth and development of plants and hence potentially 
differing duplication dynamics. Phosphate transporters im- 
port an essential macronutrient, while sodium transporters 
primarily limit the import of potentially toxic sodium (Rausch 
and Bucher 2002; Kronzucker and Britto 201 1). By narrow- 
ing our focus to just these 5 gene families and limiting 
ourselves to the three species (A. thaliana, P. trichocarpa, 
and C. papaya), it is possible to manually isolate SSD and 
WGD events. This inference in turn allows us to assess if 
the strength of selection differs following WGD, SSD, and 
speciation. 

Phosphate Transporters 

Phosphate transporters in A. thaliana are divided into four 
gene families. These families include the high-affinity trans- 
porters (PHT1 ; Mudge et al. 2002; Poirier and Bucher 2002), 
which import ions across the plasma membrane, and the 
mitochondrial (PHT3; Hamel et al. 2004) and chloroplast 
(PHT4; Guo et al. 2008) transporters, which act in their re- 
spective organelles. Finally, low-affinity (PHT2) phosphate 
transporters are also localized to the chloroplast (Versaw 
and Harrison 2002). We inferred gene phylogenies for 
the four phosphate transporter families and for one sodium 
transporter family (see Materials and Methods). Although 
the topology of phosphate transporter gene families is easily 
reconciled to the species tree, none of the clades contained 
the 4:2:1 ratio of A. thaliana to P. trichocarpa to C. papaya 
genes that would be expected if all transporters had been 
retained following the a, (3, and P. trichocarpaAN GDs and 
no SSDs had been retained (fig. 2a and b). The average 
selective constraint (K a /K s ) for PHT gene families varies con- 
siderably from 0.076 in high-affinity transporters to 0.207 in 
low-affinity transporters (table 4). The lowest KJK S corre- 
sponds to the family with the largest observed number of 
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Table 4 

Selective Constraint Estimated with Three Models of Gene Evolution for Ion Transporters of A. thaliana, C. papaya, and P. trichocarpa 











PHT2-Low- 


PHT3- 














PHT1 


-High- 


Affinity 


Mitochondrial 


PHT4-Chloroplast 










Affinity Phosphate 


Phosphate 


Phosphate 


Phosphate 


NHX-Sodium Ion 






Transporter 


Transporter 


Transporter 


Transporter 


Transporter 


Model 


Branches 


KJK S 


InL 


KJK S InL 


K a /K s InL 


KJK S 


InL 


K a /K s 


InL 


R_Null 


All 


0.076 




0.207 


0.114 


0.148 




0.049 










15379.S 


3577.0 


5898.3 




24869.3 




11210.1 


R_Dupl 


Speciation 


0.063 a 




0.1 56 a 


0.080 a 


0.123 a 




0.062 a 






Duplication 


0.082 a 




0.41 5 a 


0.133 a 


0.249 a 




0.031 a 










15376.7 


3570.2 


5894.1 




24847.0 




11188.5 


R_WGD 


Speciation 


0.063 




b 


0.080 a 


0.123 




0.061 a 






WGD C 


0.081 






0.190 a 


0.233 




0.067 a 






SSD d 


0.085 






0.112 a 


0.265 




0.018 a 










15376.6 




5891.2 




24846.7 




11175.3 



a Bold values indicate a significant improvement over the model immediately above at P < 0.05; nested likelihood ratio test (distributed % 2 , P < 0.05, degrees of freedom = 1). 
b No small scale duplications in PHT2, so model FLDupl is equivalent to model R_WGD. 

c WGD: determined by syntenic paralogy using the Plant Genome Duplication Database (Tang, Bowers, et al. 2008). 
d SSD: determined either by a lack of syntenic paralogy and/or by tandem duplication status. 



duplications (high-affinity transporters: 19 duplications), 
whereas the highest K a /K s values correspond to the family 
with the fewest duplications (low-affinity transporters: 1 du- 
plication). This observation is, however, without statistical 
significance. In all cases, the branches following gene dupli- 
cations show significantly higher KJK S than do those follow- 
ing speciation (table 4; but note that the small size of the 
low-affinity family limits the strength of our conclusion 
for that family). We also investigated selective constraints 
associated with duplication mechanism by dividing the 
branches following duplications into those due to WGD 
and to SSD. Here, the difference in selective constraint is less 
clear: for the high-affinity and chloroplast phosphate trans- 
porters, the KJK S values for whole-genome duplicates are 
not significantly different than those for SSDs. Among 
the mitochondrial transporters, whole-genome duplicates 
have significantly higher KJK S than small-scale duplicates, 
indicating a weaker selective constraint following WGD. 
The counts of WGDs versus SSDs per gene family are sta- 
tistically uninformative (Fisher's Exact test: P = 0.75). 

Sodium Transporters 

The angiosperm sodium ion transporters (NHX) are a single 
gene family responsible for keeping Na + concentrations at 
nontoxic levels (Rodriguez-Rosales et al. 2008). The sodium 
ion transporters have a lower average KJK S than do any of 
the phosphate transporter families (0.049 versus 0.076- 
0.207). Curiously, among these transporters, paralogs have 
significantly lower KJK S values than do orthologs, indicating 
no release in a selective constraint after duplication (table 4). 
Genes duplicated by WGD seem to be under slightly less 
selective constraint than gene orthologs; however, SSDs 
seem to be under considerably higher selective constraint 
than either. 



Discussion 

Selection on Plant Gene Duplications 

Although it has been hypothesized that a substantial frac- 
tion of the surviving duplicate genes in the genomes of mul- 
ticellular eukaryotes might be due to the neutral fixation of 
duplicates (Lynch and Conery 2003), other potential forces 
could also be involved (Kondrashov and Kondrashov 2006; 
Innan and Kondrashov 2010). Here, we have taken both 
a low-level and a high-level approach to look for evidence 
of selection in the process of gene and genome duplications 
in the plants. 

Selection, Sequence Evolution, and Ion Transporters 

Part of our analysis focused on sequence evolution in two 
families of ion transporters. Transporters sometimes appear 
to be the limiting step in metabolic pathways (Brown et al. 
1 998; Pritchard and Kell 2002), a fact that may partly explain 
why their evolution after both SSD and WGD is distinct from 
other metabolic genes (Lin and Li 201 0; Bekaert and Conant 
201 1). Limiting our analysis to single gene families also al- 
lows us to carefully distinguish WGD and SSD events and to 
model the selective constraints acting on these genes. 

There are two primary hypotheses regarding the expected 
changes in selective constraint following gene duplication. 
Predominant and recent neo-functionalization would predict 
KJK S > 1.0 (Zhang et al. 2003; Hahn 2009). On the other 
hand, subfunctionalization (and likely neutral retention by 
drift) would suggest that KJK S is elevated after duplication 
but not above 1.0 (Hughes 1994; Zhang et al. 1998; Force 
et al. 1 999; Lynch and Conery 2000). Importantly, both mod- 
els predict an elevated value of KJK S after duplication; how- 
ever, evidence for such increases is mixed. Hughes and 
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Hughes (1993) found no evidence for the relaxation of se- 
lective constraint among 17 genes in the tetraploid frog 
Xenopus laevis. Kondrashov et al. (2002) found that recent 
paralogs were under significantly lower selective con- 
straints than orthologs, whereas others (Lynch and Conery 
2000; Kondrashov et al. 2002; Zhang et al. 2003; Jordan 
et al. 2004) have found evidence for a decrease in selective 
constraint immediately following duplication. This relaxa- 
tion appears to be temporary; Jordan et al. (2004) found 
that the average strength of purifying selection acting 
on old duplicates was higher than for nonduplicated 
genes. This observation presumably reflects the situation 
after the fate determining mutation, which breaks the se- 
lective symmetry of two duplicates and sends them down 
differing paths (Innan and Kondrashov 2010). Among 
PHTs, our results parallel those of Jordan et al. (2004) in 
finding a general relaxation of selective constraint after 
ion transporter duplication. This result is not supported 
among the NHX transporters. This difference may be 
due to the limited evolutionary paths opened by a duplica- 
tion of sodium transporters compared with that of phos- 
phate transporters (Kronzucker and Britto 201 1). 

We also extended our analysis to differences in constraint 
between SSD- and WGD-produced duplicates. We had no 
a priori hypothesis on which mechanism would impart high- 
er selective constraint, and, in fact, we found both possible 
outcomes. 

Associations Between Duplication Propensity and 
Metabolic Flux 

We also made a large-scale analysis of the patterns of evo- 
lution in the metabolic network. To our knowledge, this 
analysis represents the first high-level phylogenomic-scale 
study of gene duplication and metabolism in angiosperms 
(for studies of metabolism following WGD in other organ- 
isms, see Gout et al. 2009; van Hoek and Hogeweg 2009). 
By focusing on metabolism, we can ask whether duplica- 
tions are randomly distributed across the network (as might 
be expected if drift were the only force at work) or show 
biases in the patterns of fixation. Notably, we find that there 
is a statistically significant relationship between duplication 
propensity and each enzyme's predicted flux. This analysis 
follows our work on absolute and relative dosage among 
the Arabidopsis WGD duplicates (Bekaert et al. 2011), 
where we found that reactions with high flux were enriched 
for enzymes coded by duplicate genes produced by the ancient 
P event (but not the more recent a event). Here, we have 
shown that the relationship between the flux and duplication 
is not specific to Arabidopsis but a more general pattern in 
plants. Although it is certainly not the case that all gene du- 
plications are associated with high-flux reactions (the asso- 
ciation magnitudes found are small), selection for increased 
gene dosage (Kondrashov and Kondrashov 2006; Conant 
and Wolfe 2008) is an attractive explanation for the fixation 



of some of these duplicates. In fact, examples of plant 
duplications apparently fixed by such selection are well 
known (van Hoof et al. 2001; Widholm et al. 2001). 

Because the association of flux with duplication holds for 
both SSD and WGD events, we propose that different types 
of selective environment favor dosage-based duplicates pro- 
duced by the two mechanisms. Thus, SSD may be useful in 
situations where the increased dosage would be beneficial 
at the tips of a pathway or in secondary metabolism: this is 
likely the case for the copper tolerance duplication in blad- 
der campion (Silene vulgaris; van Hoof et al. 2001). How- 
ever, as Kacser and Burns (1981) pointed out, for most 
metabolic pathways, it is unlikely that a single reaction is flux 
limiting, meaning that a single gene duplication is unlikely to 
alter the flux in such a pathway. WGD is a potential route to 
increased flux in such situations, and it appears that such 
selection may have occurred after a WGD in the ancestor 
of bakers' yeast (Saccharomyces cerevisiae; Conant and 
Wolfe 2007; Merico et al. 2007; van Hoek and Hogeweg 
2009). 

Taking these analyses to the subcellular level, we find 
strong correlations between flux and duplication in the mito- 
chondria and chloroplast, but not in the cytosol. This result 
suggests that the general association between flux and du- 
plication is primarily driven by reactions in these compart- 
ments, an unsurprising conclusion given the roles of the 
chloroplast and the mitochondria as the plant cell's anabolic 
and energy-yielding centers. These patterns also accord well 
with our prior analyses of the compartmental evolution in 
the Arabidopsis and human metabolic networks (Bekaert 
et al. 201 1; Hudson and Conant 201 1). 

Gene and Genome Duplication, Selection, and Contingency 

Although a WGD that occurs in a particular individual is 
much less likely to be selectively neutral than an SSD (Vieta 

2005) , it does not follow that there should be strong selec- 
tion at every locus duplicated in such an event. Although it 
might therefore appear that WGD produces a large class of 
duplicate genes that evolve more or less neutrally after 
WGD, this hypothesis is difficult to reconcile with observa- 
tions such as the dosage balance hypothesis. To distinguish 
between these two hypotheses, one might consider what 
the sources of variation in the selective constraint are for 
the set of WGD-produced duplicate genes in a genome. 
In fact, the number of sources of variation in constraint 
among duplicates at large (Duret and Mouchiroud 2000; 
Pal et al. 2003; Drummond et al. 2006; Vitkup et al. 

2006) suggests the importance of "contingency" in dupli- 
cate evolution. In other words, a duplicate's fate will depend 
on both its intrinsic properties (including factors studied 
here, such as function, cellular compartment, and duplica- 
tion mechanism) as well as the environment in which it finds 
itself at birth. 
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Supplementary Material 

Supplementary table 1 is available at Genome Biology and 
Evolution online (http://www.gbe.oxfordjournals.org/). 
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